Consistent Weighted Sampling Made More Practical

نویسندگان

  • Wei Wu
  • Bin Li
  • Ling Chen
  • Chengqi Zhang
چکیده

Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets – Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5 ∼ 1/3 reduced empirical runtime.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using retrospective sampling to estimate models of relationship status in large longitudinal social networks

Estimation of longitudinal models of relationship status between all pairs of individuals (dyads) in social networks is challenging due to the complex inter-dependencies among observations and lengthy computation times. To reduce the computational burden of model estimation, a method is developed that subsamples the "always-null" dyads in which no relationships develop throughout the period of ...

متن کامل

Exponential Approximation of Bandlimited Functions from Average Oversampling

Weighted average sampling is more practical and numerically more stable than sampling at single points as in the classical Shannon sampling framework. Using the frame theory, one can completely reconstruct a bandlimited function from its suitably-chosen average sample data. When only finitely many sample data are available, truncating the complete reconstruction series with the standard dual fr...

متن کامل

Generalized sampling: a variational approach .II. Applications

The variational reconstruction theory from a companion paper finds a solution consistent with some linear constraints and minimizing a quadratic plausibility criterion. It is suitable for treating vector and multidimensional signals. Here, we apply the theory to a generalized sampling system consisting of a multichannel filterbank followed by a nonuniform sampling. We provide ready-made formula...

متن کامل

Uniform Error Bounds for Reconstruct Functions from Weighted Bernstein Class

Errors appear when the Shannon sampling series is applied to reconstruct a signal in practice. This is because the sampled values may not be exact, or the sampling series may have to be truncated. In this paper, we study errors in truncated sampling series with localized sampling for band-limited functions from weighted Bernstein class. And we apply these results to some practical examples.

متن کامل

Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression

Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples. We consider semiparametric models for which solution of infinite dimensional estimating equations leads to √ N consistent and asymptotically Gaussian estimators of both Euclidea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017